7 research outputs found

    Evaluierung und Erweiterung von MapReduce-Algorithmen zur Berechnung der transitiven HĂŒlle ungerichteter Graphen fĂŒr Entity ResolutionWorkflows

    Get PDF
    Im Bereich von Entity-Resolution oder deduplication werden aufgrund fehlender global eindeutiger Identifikatoren Match-Techniken verwendet, um zu bestimmen, ob verschiedene DatensĂ€tze dasselbe Realweltobjekt darstellen. Die inhĂ€rente quadratische KomplexitĂ€t fĂŒhrt zu sehr langen Laufzeiten fĂŒr große Datenmengen, was eine Parallelisierung dieses Prozesses erfordert. MapReduce ist wegen seiner Skalierbarkeit und Einsetzbarkeit in Cloud- Infrastrukturen eine gute Lösung zur Verbesserung der Laufzeit. Außerdem kann unter bestimmten Voraussetzungen die QualitĂ€t des Match-Ergebnisses durch die Berechnung der transitiven HĂŒlle verbessert werden

    Privacy-Preserving Record Linkage for Big Data: Current Approaches and Research Challenges

    Get PDF
    The growth of Big Data, especially personal data dispersed in multiple data sources, presents enormous opportunities and insights for businesses to explore and leverage the value of linked and integrated data. However, privacy concerns impede sharing or exchanging data for linkage across different organizations. Privacy-preserving record linkage (PPRL) aims to address this problem by identifying and linking records that correspond to the same real-world entity across several data sources held by different parties without revealing any sensitive information about these entities. PPRL is increasingly being required in many real-world application areas. Examples range from public health surveillance to crime and fraud detection, and national security. PPRL for Big Data poses several challenges, with the three major ones being (1) scalability to multiple large databases, due to their massive volume and the flow of data within Big Data applications, (2) achieving high quality results of the linkage in the presence of variety and veracity of Big Data, and (3) preserving privacy and confidentiality of the entities represented in Big Data collections. In this chapter, we describe the challenges of PPRL in the context of Big Data, survey existing techniques for PPRL, and provide directions for future research.This work was partially funded by the Australian Research Council under Discovery Project DP130101801, the German Academic Exchange Service (DAAD) and Universities Australia (UA) under the Joint Research Co-operation Scheme, and also funded by the German Federal Ministry of Education and Research within the project Competence Center for Scalable Data Services and Solutions (ScaDS) Dresden/Leipzig (BMBF 01IS14014B)

    Evaluierung und Erweiterung von MapReduce-Algorithmen zur Berechnung der transitiven HĂŒlle ungerichteter Graphen fĂŒr Entity ResolutionWorkflows

    Get PDF
    Im Bereich von Entity-Resolution oder deduplication werden aufgrund fehlender global eindeutiger Identifikatoren Match-Techniken verwendet, um zu bestimmen, ob verschiedene DatensĂ€tze dasselbe Realweltobjekt darstellen. Die inhĂ€rente quadratische KomplexitĂ€t fĂŒhrt zu sehr langen Laufzeiten fĂŒr große Datenmengen, was eine Parallelisierung dieses Prozesses erfordert. MapReduce ist wegen seiner Skalierbarkeit und Einsetzbarkeit in Cloud- Infrastrukturen eine gute Lösung zur Verbesserung der Laufzeit. Außerdem kann unter bestimmten Voraussetzungen die QualitĂ€t des Match-Ergebnisses durch die Berechnung der transitiven HĂŒlle verbessert werden

    Evaluierung und Erweiterung von MapReduce-Algorithmen zur Berechnung der transitiven HĂŒlle ungerichteter Graphen fĂŒr Entity ResolutionWorkflows

    No full text
    Im Bereich von Entity-Resolution oder deduplication werden aufgrund fehlender global eindeutiger Identifikatoren Match-Techniken verwendet, um zu bestimmen, ob verschiedene DatensĂ€tze dasselbe Realweltobjekt darstellen. Die inhĂ€rente quadratische KomplexitĂ€t fĂŒhrt zu sehr langen Laufzeiten fĂŒr große Datenmengen, was eine Parallelisierung dieses Prozesses erfordert. MapReduce ist wegen seiner Skalierbarkeit und Einsetzbarkeit in Cloud- Infrastrukturen eine gute Lösung zur Verbesserung der Laufzeit. Außerdem kann unter bestimmten Voraussetzungen die QualitĂ€t des Match-Ergebnisses durch die Berechnung der transitiven HĂŒlle verbessert werden

    Optimization of the Mainzelliste software for fast privacy-preserving record linkage

    Get PDF
    Background: Data analysis for biomedical research often requires a record linkage step to identify records from multiple data sources referring to the same person. Due to the lack of unique personal identifiers across these sources, record linkage relies on the similarity of personal data such as first and last names or birth dates. However, the exchange of such identifying data with a third party, as is the case in record linkage, is generally subject to strict privacy requirements. This problem is addressed by privacy-preserving record linkage (PPRL) and pseudonymization services. Mainzelliste is an open-source record linkage and pseudonymization service used to carry out PPRL processes in real-world use cases. Methods: We evaluate the linkage quality and performance of the linkage process using several real and near-real datasets with different properties w.r.t. size and error-rate of matching records. We conduct a comparison between (plaintext) record linkage and PPRL based on encoded records (Bloom filters). Furthermore, since the Mainzelliste software offers no blocking mechanism, we extend it by phonetic blocking as well as novel blocking schemes based on locality-sensitive hashing (LSH) to improve runtime for both standard and privacy-preserving record linkage. Results: The Mainzelliste achieves high linkage quality for PPRL using field-level Bloom filters due to the use of an error-tolerant matching algorithm that can handle variances in names, in particular missing or transposed name compounds. However, due to the absence of blocking, the runtimes are unacceptable for real use cases with larger datasets. The newly implemented blocking approaches improve runtimes by orders of magnitude while retaining high linkage quality. Conclusion: We conduct the first comprehensive evaluation of the record linkage facilities of the Mainzelliste software and extend it with blocking methods to improve its runtime. We observed a very high linkage quality for both plaintext as well as encoded data even in the presence of errors. The provided blocking methods provide order of magnitude improvements regarding runtime performance thus facilitating the use in research projects with large datasets and many participants

    Distributed privacy-preserving record linkage using pivot-based filter techniques

    No full text
    Privacy-preserving record linkage (PPRL) aims at linking person-related records from different data sources while protecting privacy. It is applied in medical research to link health data without revealing sensible person-related data. We propose and evaluate a new parallel PPRL approach based on Apache Flink that aims at high performance and scalability to large datasets. The approach supports a pivot-based filtering method for metric distance functions that saves many similarity computations. We describe our distributed approaches to determine pivots and pivot-based linkage. We also demonstrate the high efficiency of the approach for different datasets and configurations.This work was partially funded by the German Academic Exchange Service (DAAD) and Universities Australia, as well as the German Federal Ministry of Education and Research within the project Competence Center for Scalable Data Services and Solutions (ScaDS) Dresden/Leipzig (BMBF 01IS14014B), where this work was conducted

    Record Linkage Evaluation and Optimization of the Mainzelliste Pseudonymization Service

    No full text
    <p>synthetically generated person related datasets used in the evaluation of linkage quality and runtime of the Mainzelliste. To generate person records we used the established GeCo data generator modified with small extensions such as including look-up files for German names in addition to English names. A generated dataset consists of two subsets, org and dup, to be compared witheach other. The duplicate records can contain data errors (e.g., different<br> but similarly sounding letters, OCR errors or typos) to simulate reduced data quality, making matching more<br> challenging.</p
    corecore